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Abstract We present, as a proof of concept, a way to parallelize the Clifford prod- 
uct in Clp. q for a diagonalized quadratic form as a new procedure cmulwpar in the 
CLIFFORD package for Maple®. The procedure uses a new Threads module 
available under Maple 15 (and later) and a new CLIFFORD procedure cmuiw which 
computes the Clifford product of any two Grassmann monomials in Cl p , q with a 
help of Walsh functions. We benchmark cmulwpar and compare it to two other pro- 
cedures cmuiNUM and cmuiRS from CLIFFORD. We comment on how to improve 
cmulwpar by taking advantage of multi-core processors and multithreading avail- 
able in modern processors. 



1 Introduction 

In (5j we have described how to use (graded) tensor products and periodicity iso- 
morphisms of real Clifford algebras to accomplish computations in Clifford algebras 
over vector spaces of dimensions higher than 8. We accomplish these computations 
with Maple packages CLIFFORD and Bigebra which have been described thor- 
oughly in lfT143l[TOl. These packages have proven to be indispensable when deriving 
mathematical results presented in, for example, ll6ll9] UTI . Often CLIFFORD and 
Bigebra have been used to prepare examples in support of a mathematical theory 
in place of hand computations, e.g., 03), and especially when computing in higher 
dimensions. 
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Recent applications in engineering use real Clifford (geometric) algebras like 
Cl%2 when modeling geometric transformations in robotics. |fl"3l Thus, there is a 
need for efficient and fast symbolic computations which not only take advantage of 
the mathematical theory, for example by using the periodicity theorems, but take 
also full advantage of recent multi core hardware and software models supporting 
parallel computing. 

In this note, we present an experimental procedure cmulwpar from CLIFFORD 
which utilizes the threading module available in Maple 15 and later. Maple sup- 
ports a coarse grained tasks based model for parallel computing, which abstracts the 
need to actually deal with threads, locks and other low level constructs. The proce- 
dure cmulwpar, for now, computes, the Clifford product of two arbitrary symbolic 
elements of type clipoiynom in the real Clifford algebra C£ p , q for a diagonalized 
quadratic formQ 

In CLIFFORD the user can chose between two algorithms to compute the Clif- 
ford product, or supply his/her own routine (not necessarily computing the Clifford 
product). The two main procedures are cmuiNUM and cmuiRS which compute the 
Clifford product of any two basis monomials of type clibasmon in Clifford alge- 
bras C£(B) of any bilinear form B. The former is based on Chevalley's recursive 
definition of the product and performs usually better for bilinear forms with nu- 
meric entries, especially if many of them are zero. The latter is based on the Hopf 
algebraic Rota-Stein cliffordization process and computes faster on fully symbolic 
bilinear forms. These routines are highly optimized for speed as they use internal 
features of Maple like hashing of already computed results (using the remember 
option). Although we have succeeded in parallelizing them after making all pro- 
cedures internal to them thread-safe, in this note we concentrate on computations 
in real Clifford algebras Ci p>q of a non-degenerate quadratic form and the simpler 
cmulwpar procedure. 

A third experimental procedure available to CLIFFORD is cmuiw. It belongs 
(for now) to waishpackage developed by the authors, cmuiw uses binary coding of 
basis elements and Walsh functions, see for example 0151 . to compute the Clifford 
product of any two basis monomials in Cl p , q for a quadratic form of signature (p,q) 
in an orthogonal basis. We like to recall that CLICAL, a stand-alone semi-symbolic 
"calculator" for C£ Ptq designed by Lounesto et al. lfl2l[T6l already in 1987, was 
based on binary coding and Walsh functions for internal data handling and storage. 

In Section I2 we display and briefly discuss the code of cmuiw and cmulwpar 
which internally uses cmuiw for a product of any two basis monomials. We de- 
scribe a mechanism in CLIFFORD which permits the user to select which of the 
procedures cmuiNUM, cmuiRS, cmuiw, or even a user provided procedure, is used 
internally by the active, non parallel, top-level procedure cmul furnishing the Clif- 
ford product in C£(B) (the first two) or C£ pq (the third one). Then, we benchmark 
the procedures, namely, the parallel cmulwpar against the sequential cmuiw, cmul 
with cmuiRS, and cmul with cmuiNUM for some test computations of the most gen- 
eral Clifford polynomials in Cl p . q for p + q < 9. The complete and well commented 

1 In worksheets posted at |4), we show how to extend this multi-threading to quantum Clifford 
algebras C£(B) of any arbitrary bilinear form B. 
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code of all Maple worksheets showing these computations including parallelized 

cmuiNUM and cmuiRS is available at j4)- 



2 Code of cmulWand cmulWpar 

2.1 The Cliff ord product based on Walsh functions 

First, we present the code of cmuiw which we use later in the parallel proce- 
dure cmulWpar. The latter procedure relies on several other procedures, which we 
do display here for the sake of completeness, and which handle things like pro- 
ducing the Clifford product on basis monomials (walsh) and the data conversion 
(convert (<bas>, <data-typei>) from CLIFFORD'S internal data structures for 
basis monomials and their representations as binary tuple used by the opius and 
Walsh procedures. As cmuiRS and cmuiNUM do not have to perform these conver- 
sions, there is a slight loss of speed here due to the data conversion, twist provides 
the proper sign factor due to the grading which is easily computed from the binary 
(Gray code) representation of the Clifford monomials. 

Listing 1 Clifford product on basis monomials el, e J using Walsh functions in Ci p . q 

cmulW : =proc (el: : clibasmon , e J : :clibasmon, 

Bl : : {matrix, list (nonnegint ) } ) 
local a, b, ab, monab , Bsig, flag, i , dim_V_loc , ploc , qloc , 
_BSIGNATUREloc ; 

# — this procedure depends on external variables 
global dim_V,_BSIGNATURE, p, q; 

if type (Bl, list) then 
ploc,qloc:=op(Bl) ; 
dim_V_loc : =ploc+qloc : 
_BSIGNATUREloc := [ploc, qloc] : 
else 

ploc, qloc :=p, q; ###<<< — this reads global p and q 
dim_V_loc : =dim_V : ###<<< — this reads global dim_V 
_BSIGNATUREloc := [ploc, qloc] : 

if not _BSIGNATURE= [ploc, qloc] then _BSIGNATURE : = [p, q] end if: 
end if : 

# — data structure conversion: string to binary 

a, b : =convert (el , clibasmon_to_binarytuple , dim_V_loc ) , 
convert ( e J, clibasmon_to_binarytuple , dim_V_loc ) ; 

# — mod 2 binary addition 
ab : =oplus (a, b) ; 

# -- data structure conversion: binary to string 
monab : =convert (ab, binarytuple_to_clibasmon ) ; 
return 

twist (a, b, _BSIGNATUREloc ) * Walsh (a, hinversegGrayCode (b) ) traonab; 
end proc : 
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2.2 Maple's threading mechanism for coarse grained parallel 
computing 

The following example is taken from Maple's help page ?Threads : -Task : -starto 
It explains how to split a computation into pieces when the computation is 'large' 
enough to profit from a parallel execution, and then execute the parallel tasks and 
use a continuation function to produce the result. The example computes Yd=\ 

Listing 2 Task threading example 



continuation := proc ( a, b ) # add two results 




return a + b; 




end proc ; 




task : = proc ( i , j ) 




# distributes the computation into tasks 




local k; 




if ( j-i < 1000 ) then 




# if the range is small, just compute 




return add ( k, k=i . . j ) ; 




else 




# split computation into two parts 




k := floor ( (j-i) 12 ) + i; 




# produce two child tasks, by calling task 


recursively 


Threads : - Task : - Continue ( continuation, 




Task=[ task, i, k ] , Task= [ task, k+1, 


j ] ); 


end if; 




end proc ; 




# compute sum 1..10~7 parallel and using add 




Threads :-Task:-Start (task, 1,1CT7) = add (i, i = l . . 10 


"7) ; 



The parallelism is coarse-grained, the user does not have to deal with threads, and, 
for a large part, with locks. However, the involved routines have to be programmed 
in a thread-safe fashion0 Since we want to demonstrate how to parallelize the Clif- 
ford product cmuiw in Cl p . q , in the next section lZ3l we will discuss only cmuiwpar. 
The code of cmuiNUMpar and cmuiRSpar will be available in the worksheets [|4) 
accompanying this paper. 



2.3 The parallel procedure cmulWparfor the Clifford product 

We discuss briefly the code of cmuiwpar, the parallelized version of the Clifford 
multiplication based on the Walsh functions core multiplication of the Clifford 
monomials used in cmuiw shown in section I2TTI This code needs at least Maple 
15, and it has been tested in Maple 15 and Maple 16. 

2 See im . 

3 The highly optimized routines cmulRS and cmulNUM turned up initially not to be thread-safe 
due to relying on an internal function which was not defined in a thread-safe manner. After fixing 
this problem, both have been now successfully parallelized. 
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The idea is to implement the Clifford multiplication along the lines of the exam- 
ple given above in the listing [2] As we parallelize a procedure with two arguments, 
we need to deal with each argument separately which slightly complicates the pro- 
cedure. 

Listing 3 Parallelized version cmulWpar of the Clifford product using cmulW 

cmulWpar : =proc (x,y) 

local i, cf, term, col , lstl , co2 , lst2 , 1st , task , addUp; global p,q; 

# — process x : turn Clifford polynomial x into a 

# list of type :: list [coef f , monom] 
if type (x, '+") then # x is a sum 

lst:=[op(x) ] ; lstl:=[] : 

for i in 1st do # split each term into [ coef f, monom] 

if type (i, clibasmon) then 
lstl : = [op (lstl) , [1, i] ] ; 
else 

term, cf : =s elect remove ( type , i , clibasmon ) ; 
lstl : = [op (lstl) , [cf,term] ] ; 
end if ; 
end do ; 

elif type(x,'*') then # x is a term 

term, cf : =select remove (type , x, clibasmon ) ; 

lstl := [ [cf , term] ] ; 
else # x is a monom 

lstl:=[[l,x]]; 
end if; 

# — process y : turn Clifford polynomial y into a 

# list of type :: list [coef f, monom] 
if type (y, '+") then # y is a sum 

1st : = [op (y) ] ; 
lst2 :=[] : 

for i in 1st do # split each term into [ coef f, monom] 

if type (i, clibasmon) then 
lst2 : = [op (lst2) , [1, i] ] ; 
else 

term, cf : =s elect remove ( type , i , clibasmon ) ; 
1st 2 : = [op (1st 2) , [cf,term] ] ; 
end if ; 
end do; 

elif type (y , ' * ' ) then # y is a term 

term, cf : =select remove (type , y , clibasmon ) ; 

lst2 := [ [cf , term] ] ; 
else 

lst2 : = [ [ 1 , y ] ] ; # y is a monom 

end if; 

#==================================================== 

# — set up multitasking 

# — continue function, add up results of task processes 
addUp : =proc (a, b) a+b end proc : 

# -- task definition 
task :=proc (lstl,lst2) 

local i,j, packsize , 1st , lstpair ; 

packsize:=4; # size for sequential processing 
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# -- if x and y are small, just compute 




if max (nops ( 1st 1 ) , nops ( lst2 ) ) <=packsi ze then 




add (add (lstl [i] [1] *lst2 [ j] [ 1 ] *cmulW (lstl [i] [2 


] , 


lst2 [j] [2], [p, q] ) , i=l . .nops (lstl) ) , j=l . . n 


ops ( lst2 ) ) ; 


# — split the larger list for parallel processing 


elif nops (lstl) < nops(lst2) then # process 


lst2 


# — split lst2 (y) 




lstpair : =lst2 [ 1 . . packs ize ] , 1st 2 [packsize+1 . . - 


i] ; 


# — produce two new tasks for the split list 


lst2 


Threads : -Task : -Continue (addUp, 




Task=[ task, lstl, lstpair [1]], 




Task=[ task, lstl, lstpair [2]] ) ; 




else # process lstl 




# — split lstl (x) 




lstpair : =lstl [ 1 . . packs ize ] , lstl [packsize+1 . . - 


i] ; 


# — produce two new tasks for the split list 


lstl 


Threads : -Task : -Continue (addUp, 




Task=[ task, lstpair [1], lst2], 




Task=[ task, lstpair [2], lst2] ) ; 




end i f ; 




end proc : 




# — start computation and collect results 




Threads:-Task:-Start ( task, lstl,lst2); 




end proc : 





The procedure cmuiwpar starts of by processing the inputs x,y, which may be 
Clifford polynomials. First, it splits x into a list lstl of lists of type : :List 
[coef f , monom] , where coef f as a base ring element, and monom is a Clifford basis 
monomial e/. This splitting is done using the type clibasmon from CLIFFORD. 
Similarly, y is split into the list ist2. This conversion could be made external by 
defining a new (external) procedure, however, as CLIFFORD deals internally dif- 
ferently with (multi)linearity we keep it inline here, saving also two function calls. 
The signature (p,q) of the current quadratic form is passed on to cmuiw through two 
Maple variables p and q which are declared "global" to cmuiwpar. 

The parallel processing starts with the definition of addUp, which adds the results 
later provided by two tasks. The main routine is task, which operates on pairs of 
type : : List [coef f, monom] . We pass here Cartesian products of the two lists in 
effect. Maple provides in the combinat package a way to pass an iterator, which 
saves memory. However, regarding thread safety we refrained from using this de- 
vice yet. The parameter packs ize sets a threshold from which size onwards parallel 
processing is applied. If both lists are small compared to packsize, task just com- 
putes the result directly, as in the summation example in listing|2] Otherwise, one of 
the lists is 'large' and we split the larger list recursively to produce two new tasks. To 
do so we use the Threads : -Task : -Continue (...) function. This proceeds unless 
both lists are 'small' and are actually computed in their respective threads. Finally, 
the Threads : -Task : -start (...) routine initializes the threading mechanism and 
starts producing the task in separate threads and also collects the results. 

The number of tasks produced is also the number of threads Maple produces. 
On a 4-core cpu one would like to have 4 threads only, all taking equal time to 
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compute. For that reason we should compute the parameter packsize dynamically. 
The Maple procedures like Add, Seq, etc., do this. At the moment we use a static 
packsize and have to compromise between an optimal number of threads and los- 
ing parallelism at all. Experiments show that the input and dimension of CI have a 
large impact on a good choice for packsize, a reasonable setting is about 16. How- 
ever, let us consider two Clifford polynomials x, y with 1,000 terms each. Then the 
above procedure with packsize=i6 will produce roughly 3,906 tasks and hence as 
many threads. This clearly contradicts the idea of a coarse grained parallelism and 
calls for a dynamical setting of packsize. 

As long as all involved procedures are thread-safe, that is, they can be used with- 
out any further (negative) side effects at a threat of miscomputing, parallelizing 
Maple procedures is formally straightforward. However, to achieve efficiency one 
needs some understanding of what is going on internally. 



3 Benchmarking results for emu lWp a r versus sequential 
multiplications cmulW, cmulRSand cmulNum 

Benchmarking Maple procedures is not an easy task, and it becomes even more com- 
plicated if threads and parallel computing are in use. Firstly, Maple has a garbage 
collection and it is largely out of control when it does that. Using 'cputime' as a 
measure, one needs to take into account that Maple adds up the cpu times on differ- 
ent cores. So, running a procedure on 2 cores for 3 seconds each will be reported 6 
seconds of the cpu time usage. Having the administrative overhead, parallel compu- 
tations will take more cputime than single threaded computations. 

The second possibility is to use 'realtime' which measures the clock time when 
executing a process. If the two processes above run in parallel, then this should 
take 3 seconds, but Maple has to share the processor with the operating system 
and possibly other applications which currently run. Hence, benchmarking has to 
be done on a clean idle system to get reproducible and comparable results, and 
this is what we have ensured to be the case. A useful tool for such benchmarking 
is the codeToois : -Usage (...) procedure of Maple. However, the user may be 
warned that profile and CodeToois packages of Maple are not yet thread safe and 
especially profile shows at times even negative run times. 

We have tested the above given parallelized Clifford product on two machines. 
The first one is a dual core Windows XP (SP3) machine with Intel (R) Core (TM) 2 
Duo CPU 2.19 GHz and 2.9 GB RAM. The second machine is a core i7-2640QM at 
2.8-3.5 GHz and 8 GB RAM running ubuntu 11.10 Linux and it has a physical dual 
core with 4 hyper threading virtual cores seen by linux. The two machines give, up 
to a scaling factor of about 2, the same results, so we show only one set of data here. 
See the appendix for Maple's perception of how many cores are available. 

Having a dual or quad core available, we can expect a theoretical speedup of at 
most a factor of 2 or 4, which in practice suffers from administrative overhead of 
the threading software. The below given speedups cannot then be attributed to the 
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Table 1 Benchmarking CPU times: t\ of cmulWpar; 
h. of cmulW, t-$ of cmulRS and U of cmulNUM 



AimV 


f i [sec] 


t2 [sec] 


h/h 


t 3 [sec] 




?4 [sec] 


k/ti 


2 





0.063 


oo 


0.047 




0.047 


oo 


3 


0.046 


0.172 


3.7 


0.250 


5.44 


0.203 


4.41 


4 


0.250 


0.766 


3.1 


0.828 


3.31 


0.984 


3.93 


5 


1.0.94 


3.000 


2.74 


4.594 


4.20 


5.532 


5.05 


6 


4.687 


14.68 


3.13 


36.09 


7.7 


43.61 


9.3 


7 


22.000 


136.0 


6.18 


506.1 


22.9 


448.6 


22.5 


8 


112.47 


NA 


NA 


NA 


NA 


NA 


NA 


9 


647.20 


NA 


NA 


NA 


NA 


NA 


NA 



parallelizing alone, but must also be partially caused by the different ways the proce- 
dures compute and by the involved data structures. It looks like as if thinking about 
the threading model leads one to more efficient code per se. A sort of a measure to 
check if parallelization does really occur is given by 

a) checking the system load during computation to see if all cores are running under 
fuU load, 

b) computing the quotient cputime/realtime, which somehow measures the 'effec- 
tive number of cores in use' . 

The ratio cputime/realtime varies a lot over the input to cmulWpar and it seems 
to vary from 0.9 to 1.8 on the second machine given above. Maple allows one to set 
the number of cores (called cpu's in Maple) in use. In this way one can, with care 
(see the appendix) benchmark the parallelized threaded code on a single core versus 
several cores. As we do not have quad and octo-core machines available, we cannot 
demonstrate such results which would really show how the threading mechanism 
scales on the number of cores. 

In TableQ]we summarize some of our benchmarking results We have computed 
Clifford products of two most general Clifford polynomials X and Y in the Clifford 
algebras C£(p,Q) for p < 9. The table shows CPU times in seconds as they were 
returned by the Maple procedure codeToois : -Usage (...) where t\ is the CPU 
time taken to compute such product by the parallel procedure cmulWpar, whereas 
t%, ?3, and f 4 are the cpu times needed by the ordinary, non-parallel Clifford product 
procedure cmul using internally cmuiw, cmulRS, and cmulNUM, respectively. The 
computations in dimensions 8 and 9 with cmuiw, cmulRS, or cmulNUM were not 
completed due to running out the RAM memory. For example, in dimension 9 they 
were stopped after 7.8GB RAM had been consumed and the machine started to 
swap. 



4 These times were obtained on Windows XP (SP3) machine with Intel (R) Core (TM) 2 Duo CPU 
2.19 GHz and 2.9 GB RAM. 
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4 Conclusions 

We have shown how to use Maple's task model, a coarse-grained parallel computing 
framework, to parallelize the Clifford product for the CLIFFORD package. As long 
as such computations are side-effects free and thread safe, this is easily achieved by 
using the Threads package from Maple. 

We hit on some problems when trying to parallelize the core multiplication pro- 
cedures cmuiRS and cmuiNUM, which are highly optimized mathematically in their 
algorithms, and, on the software side, by extensive hashing of precomputed re- 
sults (Maple remember tables). However, after removing some internal procedures 
these are now thread safe too, and have now parallel versions cmuiNUMpar and 

cmulRSpar. 

In this paper we have discussed the very fast procedure cmuiw for computing the 
Clifford product in orthogonal bases and arbitrary signature as it was already thread- 
safe and allowed immediate parallelization. Amazingly the parallelized version is 
much faster than the theoretical limit allows, so that this speedup is not solely due to 
our parallelizing the computation. It seems that dealing with the threading package 
of Maple forced us to produce more efficient coding especially of the multilinear 
features. We have provided detailed benchmark results showing the speedups and 
also discussed how to separate the speedup by the given different coding and that 
coming from actual parallel computing. We find a speedup on large Clifford polyno- 
mials due to parallelizing of up to 1 .8 on a dual core machine, which is what one can 
expect and shows that parallel computing is by now feasible in symbolic computer 
algebra. This is possibly good news for engineers and roboticists who do computa- 
tions in higher dimensional Clifford algebras like Clg 2 f° r geometric computations. 

Thus, we must be cautious when examining the speedup ratios ti/fi and t^/ti 
shown in Table 1. We repeat to caution the reader that, for example, the speedup 
factor of around 22 in dimension 7 cannot be attributed exclusively to the paral- 
lelization process alone. This speedup is a combination of factors, due to, for ex- 
ample, making the overhead in cmuiwPar dealing with the bilinearity much smaller 
and faster than the resources- and time-demanding procedure clibiiinear from 
CLIFFORD. The latter includes among other things time-consuming type checking 
of the input on many levels of recursion, especially when cmuiNUM is used. At the 
moment we do actually profit from multi-threading seen by computing the number 
of effective cores cputime/realtime ~ 0.9-1.5. However, we saw that a bit of reor- 
ganization of the data structures and the recursive way to do the products (saving 
memory) gives us an even more substantial speedup. It is questionable if multi- 
threading is only valuable when one knows that one's code is at its theoretical limit 
with respect to space and time complexity, and CLIFFORD is not yet at that limit. 
We suspect that CLIFFORD could be faster at least by an overall factor of more 
than 20-30, based on this current experience, by a generic rewrite using better data 
structures and avoiding all the repetitious parsing and type checking where it can 
be avoided, and using the recursive way to split (multi)linearity, etc. Optimizing 
CLIFFORD and its related packages like Bigebra, Cliplus, Octonion, etc. 
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J3] is a priority whose urgency has been emphasized by this exercise in parallelizing 
the Clifford product. 

The results discussed here are accompanied by Maple worksheets posted on |4|. 
These well-documented worksheets contain further results and alternatives, as us- 
ing the inherently parallel procedures Add, Seq, Map of Maple or directly producing 
threads. There we further discuss the efficient usage of Maple's Threads package. 
We are working to make all of CLIFFORD thread safe after we have succeeded par- 
allelizing the more complex and complicated cmuiRS and cmuiNUM routines. While 
cmuiRS is based on a provable optimal algorithm, the above discussion still sheds 
some light on efficiency of the implementations due to different data structures or 
recursive computing models (saving memory usage). In that respect, this is a very 
open area of research. 

Acknowledgements Bertfried Fauser wants to thank Darin Ohashi from Maplesoft for his kind 
help with and email discussions about Maple's threading mechanism. Both authors thank referees 
for their comments as they have helped us extend this work and improve its clarity. 



Appendix 

In this appendix we make some remarks about programming practice using the dif- 
ferent versions of parallelizing code in Maple, using the threading module of Maple 
(from version 15 onwards), and the inherently parallel routines such as Add, Mul, 

Map, and Seq. 



4.1 Thread safety 

The Threads package was introduced in Maple 15 and it was improved for Maple 
16. Still, large parts of Maple are not yet 'thread safe', that is, the code cannot be 
run in parallel as it may cause side effects which can interfere with other threads. 
A common source of such problems are global variables and name space con- 
flicts. For example, Maple's parser will not complain when a running variable in 
add (f (i) , i=i . . 10) ; or seq (i"2, i=i . .10) ; is not declared local: it will simply 
miscompute. The following procedure dummy will miscompute in a threading envi- 
ronment unless the local variables i and j are explicitly declared local to dummy. 

Listing 4 Local running variables have to be declared local 

dummy : =proc (x : : List [expression ] , N : : Integer ) 

# local i,j; # <== needed! 

# assignment of j produces a waring if not declared local 
j : =x [ 1 ] ; 

add (x [i] , i=l . . N) ; 
end proc : 
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Note, that Maple's parser does not complain about an undeclared local variable, so 
this issue slips through unnoticed, as it will for the (here unused) variable j if not 
declared local. 

The next issue is more subtle. If in a procedure dummy one has a helper proce- 
dure fun, declared local to it, and if this function is defined with a remember table, 
either by option or by assigning certain values to it, then these assignments are seen 
globally, hence are visible to all threads! In effect, any thread can access values set 
other threads, and this will ultimately lead to errors. 

Listing 5 Avoid procedures local to a procedure 

# NOT thread safe procedure 

dummy :=proc (x: : List [ expression ] , N : : Integer ) 
local fun, i; 

# — local helper procedure 
fun:=proc (y: :any) 

# — either remember here 
option remember; 

return 3*y~2+4; # just some computation 
end proc : 

# — or use special cases (implements a remember table) 
fun(x[l] ) :=x[l] ; 

fun(x[N] ) :=x[N] ; 
add (fun (x [i] ) , i=l . .N) ; 
end proc : 

CLIFFORD used such a construction to implement permutations in reordering 
wedge products of Grassmann basis monomials, and this rendered cmuiNUM and 
cmuiRS not thread safe at first. The reordering had to be done sequentially or it 
needed to be done differently. 

A further issue with threading comes from the fact that the programmer must 
check every procedure the given package uses whether it is 'thread safe'. Each 
Maple procedure has a help page and there it is marked if this particular proce- 
dure is thread safe, and for which version of Maple onwards. If no such statement 
is given, one must assume that the procedure as not thread safe. A suggestion of 
the referee to use the combinat :-cartprod construction to iterate over Cartesian 
products of sets cannot be realized, as this function is not (yet) declared by Maple 
thread safe. 

Unfortunately for benchmarking issues, neither profile nor the CodeToois 
package is thread safe yet either. While profile seems to be broken as it reports 
once in a while negative running times, the function CodeToois : -Usage seems to 
be reasonably stable. For that reason we used this utility to do our benchmarks, but 
ran also checks on the results via external timing. 
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4.2 Overhead versus gain in parallel code 

Given the example from listing[2]one observes that just adding integers in parallel is 
slower than doing so sequentially. A similar result is obtained if one uses the parallel 
code snippet Add (i, i=l . .10*7) ;. This shows clearly that the overhead introduced 
by Maple to produce threads is too large to provide any gain in speed. However, the 
situation changes when one computes a more complicated sum, e.g., Y^=\ i 2 ^ eval- 
uated as float. The code snippet looks like Add(evaif (i~ (2/3) ) ,i=l. .10"7) ; 
Similar effects are encountered when one uses threads directly. To benefit from par- 
allelizing code in Maple, one has to make sure that the work done in a single task is 
as large as possible, and that reflects the idea of coarse grained parallelism imple- 
mented by Maple. 

Another suggestion of the referee to use Add, Mul, Seq, or Map inside a thread is 
also not advisable. On a processor with n cores (cpus), once n threads already have 
been produced by other devices, parallelizing will only create superficial threads 
which cannot be processed in parallel since all cpus are already busy running as- 
signed to them threads. 

Finally, there is a difference in the Threads packages for Maple 15 and 16 how 
the number of cpus is set. The procedure kerneiopts (numcpus ) ; reports the num- 
ber of 'cpus' Maple sees and uses for threading. Maple 15 sets the number of 
cpus to the number of virtual cores of a physical cpu (for example, 4 for a core 
i7-2640QM with 2 physical and 4 virtual hyper-threading cores), while Maple 16 
uses the number of physical cores, that is 2 here. However, on modern core i7 pro- 
cessors the virtual cores allow a substantial speed up due to interlocking processes 
when queues and/or pipelines are filled etc. It is therefore advisable to set the num- 
ber of cpus at the beginning of the worksheet to the number of virtual cores by 
kerneiopts (numcpus=4) ; for4 virtual cores. Note that when the threading mech- 
anism is used, it is initialized using this number, and the number of cpus cannot be 
reset later again (despite what kerneiopts (numcpus) ; later reports). 



4.3 Other parallelization mechanisms of Maple 

As we have already mentioned, the creation of tasks is not the only way you can 
use Maple's threading package. We have investigated if different approaches give 
significantly different results. 

A first group of seemingly simple to use procedures are Add, Mul, Seq, and Map, 
which parallelize the corresponding sequential (lower case) procedures. The advan- 
tage is that Maple does some load balancing in computing how many threads are 
created and that it is very simple to use these procedures. As the previous section 
shows, one is nevertheless left with benchmarking these procedures as a too naive 
usage may result in slower code. Given the Clifford product we were able to get 
roughly a similar speedup as with the method described above, which does not yet 
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use a dynamical setting of packsize. But using Add to sum up the terms in a Clif- 
ford product is very memory intensive, and this favors other solutions. 

The second method is to use the task method, using the Threads : -Task package 
providing the procedures start, Return to leave a task, and continue. As we have 
chosen this model above, there is not much to add here. 

A third way to create threads is to directly use the Threads : -Create command. 
Maple provides locks, mutexes and a synchronization using Threads : -wait to deal 
with these threads directly. For example, to compute the sum of integers Y,j=i m 2 
threads one can use this code: 

restart : 

# — define two _functions_ performing the work 
pl:=proc(x) local i; add(i,i=l ..5*10*6) end proc : 
p2:=proc(x) local i; add ( i, i=5*10 " 6+1 . . 10 " 7 ) end proc: 

# — create 2 threads executing the work 
idl :=Threads : -Create (pi , outl) ; 

id2 : =Threads : -Create (p2 ( ) , out2 ) ; 

# — wait for the two threads idl, id2 to be finished 
Threads : -Wait ( idl , id2 ) ; 

# produce the result 
outl+out2 ; 

The above code produces an output idi=i, id2=2 -showing the id's of these 
threads-, and the numerical sum 50000005000000. It is clear that this gives the 
most direct access to the threading mechanism, as the programmer can decide 
explicitly anything about the threads. This model is especially useful when sev- 
eral threads have to share resources, as one can lock variables, etc., using the 
ConditionVariabie and Mutex packages. We have benchmarked a version of 
cmuiwpar using this direct method (using numcpus=4 threads) and again obtained 
essentially the same performance results. Finding this, it seems to be advisable to 
use in Maple the easiest threading model available for the task at hand, as we have 
demonstrated. 



References 

1. Ablamowicz, R.: Clifford algebra computations with Maple. In: Baylis, W. E. (ed.) Clifford 
(Geometric) Algebras with Applications in Physics, Mathematics, and Engineering, pp. 463- 
502, Birkhauser, Boston (1996) 

2. : Computations with Clifford and Grassmann algebras. Adv. Applied Clifford Alge- 
bras 19, No. 3-4, 499-545 (2009) 

3. Ablamowicz, R., Fauser, B.: CLIFFORD with Bigebra - A Maple 
Package for Computations with Clifford and Grassmann algebras (2012), 
|http : //math . tntech . edu/raf al / | Cited June 10, 2012 

4. : Maple worksheets created with CLIFFORD for verification of the results presented 

in this paper (2012), |http : / /math .tntech . edu/raf al /publications . html | 
Cited June 10„ 2012 

5. : Using periodicity theorems for computations in higher dimensional Clifford algebras. 

Submitted (2012) 



14 



Rafal Ablamowicz and Bertfried Fauser 



6. : On the transposition anti-involution in real Clifford algebras III: the automorphism 

group of the transposition scalar product on spinor spaces. Linear and Multilinear Algebra. 
Online version: iFirst DOI = 10.1080/03081087.2011 . 624093 (2011) 

7. : On the transposition anti-involution in real Clifford algebras II: stabilizer groups of 

primitive idempotents. Linear and Multilinear Algebra, Vol. 59, No. 12, 1359-1381 (201 1) 

8. : On the transposition anti-involution in real Clifford algebras I: the transposition map. 

Linear and Multilinear Algebra, Vol. 59, No. 12, 1331-1358 (2011) 

9. : Clifford and Grassmann Hopf algebras via the Bigebra package for Maple. Com- 
puter Physics Communications 170, 115-130 (2005), |math-ph/02 1 2032"] 

10. : Mathematics of CLIFFORD - A Maple package for Clifford and Grassmann alge- 
bras. Adv. in Applied Clifford Algebras 15, No. 2, 157-181 (2005 ), [math-ph/021 203 1 1 

11. : Hecke algebra representations in ideals generated by q- Young Clifford idempotents. 

In Abiamowicz, R., Fauser, B. (eds.), Clifford Algebras and their Applications in Mathemati- 
cal Physics, Vol. 1: Algebra and Physics, Birkhauser, Boston, pp. 245-268 (2000) 

12. Ablamowicz, R., Sobczyk, G.: Software for Clifford (geometric) algebras. Appendix in 
Ablamowicz, R., Sobczyk, G. (eds.), Lectures on Clifford Geometric Algebras and Appli- 
cations, pp. 189-209, Birkhauser, Boston (2004) 

13. Bayro-Corrochano, E.: Private communication (2012) 

14. Hitzer E., Helmstetter, J., Ablamowicz, R.: Square roots of —1 in real Clifford algebras. 
|http://arxiv.org/abs/12 04 . 457 6 1 To appear (2012) 

15. Lounesto, P.: Clifford Algebras and Spinors, 2nd ed. Cambridge University Press, Cambridge 
(2001) 

16. Lounesto, P., Mikkola, R., Vierros, V: CLICAL User Manual: Complex Number, Vector Space 
and Clifford Algebra Calculator for MS-DOS Personal Computers. Helsinki University of 
Technology, Institute of Mathematics, Research Reports A248, August (1987) 

17. Maple 15 and 16 from Maplesoft, Waterloo Maple Inc. Waterloo, Ontario, 
|http: / /www . maples oft . com/ | Cited lune 10, 2012 

18. Porteous, I.: Clifford Algebras and the Classical Groups. Cambridge University Press, Cam- 
bridge (1995) 



